R basics & time series datasets

1 Introduction

1.1 About

1.2 Structure of the crash course

  1. Introduction to R basics & datasets + summary statistics
  2. Theory 1: randomness, stationarity, unit roots, random walk
  3. Theory 2: autocorrelation, ARMA models, predictive regressions
  4. Advanced models (VAR, GARCH and neural networks) and prediction

2 Packages

There are currently more than 23,000 packages on the CRAN.
But many more exist on github. They are however not vetted/verified.

2.1 Installation

Just run the following chunk:

options(timeout = 6000)
if(!require("openxlsx")){install.packages(c("tidyverse", "openxlsx", "reticulate", "quantmod", "feasts",
                                            "tsibble", "fable", "crypto2", "ggsci", "WDI"))}

2.2 Loading

library() is the equivalent to “import…” in python.

library(tidyverse)  # THE library for data science
library(openxlsx)   # A cool package to deal with Excel files/formats
library(quantmod)   # Package for financial data extraction
library(tsibble)    # TS with dataframes framework 
library(fable)      # Package for time-series models & predictions
library(feasts)     # Package for time-series analysis
library(crypto2)    # Package to access crypto data
library(ggsci)      # For cool plot palettes & colors
library(WDI)        # For World Bank data

3 Things to know with R

3.1 Data structures, assigning & indexing

The equal sign “=” goes both ways. Arrows don’t!

a <- 6
a
[1] 6

Vectors of integers….

1:12
 [1]  1  2  3  4  5  6  7  8  9 10 11 12
12:1
 [1] 12 11 10  9  8  7  6  5  4  3  2  1

Sequences of equally-spaced numbers.

seq(0, 1, by = 0.1)
 [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Custom vectors. This syntax is strange (compared to python/MATLAB), but super IMPORTANT in R.
Essentially, you embed the elements with the simple c() function.

c(2, 3, 5, 8, 13)
[1]  2  3  5  8 13

Matrices.

First method: stacking vectors (or smaller matrices).

rbind(1:5, 2:6)
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    4    5
[2,]    2    3    4    5    6
cbind(1:5, 2:6)
     [,1] [,2]
[1,]    1    2
[2,]    2    3
[3,]    3    4
[4,]    4    5
[5,]    5    6

Second method: filling with values.

matrix(1:12, nrow = 3, byrow = T)
     [,1] [,2] [,3] [,4]
[1,]    1    2    3    4
[2,]    5    6    7    8
[3,]    9   10   11   12

Indexing. It starts at one!

M <- matrix(1:72, nrow = 6)
M
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,]    1    7   13   19   25   31   37   43   49    55    61    67
[2,]    2    8   14   20   26   32   38   44   50    56    62    68
[3,]    3    9   15   21   27   33   39   45   51    57    63    69
[4,]    4   10   16   22   28   34   40   46   52    58    64    70
[5,]    5   11   17   23   29   35   41   47   53    59    65    71
[6,]    6   12   18   24   30   36   42   48   54    60    66    72
M[2:5, 2:5] # rows first, then columns, ALWAYS!
     [,1] [,2] [,3] [,4]
[1,]    8   14   20   26
[2,]    9   15   21   27
[3,]   10   16   22   28
[4,]   11   17   23   29

Dataframes: we can build them column-by-column. They are sometimes called “tibbles” in R.
Most of the time, they are just imported from an external source (Excel spreadsheet).
Below, we build one from scratch (this is often useful).

n <- 12
data.frame(t = 1:n, x = rnorm(n), y = log(1:n))
t x y
1 1.3095721 0.0000000
2 -0.4957079 0.6931472
3 0.1053022 1.0986123
4 0.1794539 1.3862944
5 0.0283460 1.6094379
6 -1.2470486 1.7917595
7 -0.5993684 1.9459101
8 -0.0833312 2.0794415
9 -1.2958980 2.1972246
10 0.1116494 2.3025851
11 0.6284734 2.3978953
12 -0.9536080 2.4849066

3.2 From type to type

Number \rightarrow character.

as.character(4)
[1] "4"

Character \rightarrow number.

as.numeric("4.5")
[1] 4.5

Text \rightarrow date. Don’t forget the ISO-8601 standard

as.Date("13/02/2005", format = "%d/%m/%Y")
[1] "2005-02-13"

Dates again.

as.Date("09/11/01", format = "%m/%d/%y")
[1] "2001-09-11"

Factors.

as.factor(c("Large", "Medium", "Small", "Small", "Medium", "Large"))
[1] Large  Medium Small  Small  Medium Large 
Levels: Large Medium Small

3.3 The tidyverse + piping

The tidyverse is an ecosystem of packages that are incredibly useful in data science tasks.
In particular:

For this section, we need data. Below, we import macroeconomic information from the World Bank API.
Technically speaking, this data is of panel type, but it’s great to work with.

wb_data <- WDI(                             # World Bank data
  indicator = c(
    "pop" = "SP.POP.TOTL",                  # Population
    "pop_growth" = "SP.POP.GROW",           # Population growth
    "gdp_percap" = "NY.GDP.PCAP.CD",        # GDP per capita
    "gdp" = "NY.GDP.MKTP.CD",               # Gross Domestic Product (GDP)
    "R_D" = "GB.XPD.RSDV.GD.ZS",            # R&D (%GDP)
    "high_tech_exp" = "TX.VAL.TECH.MF.ZS",  # High tech exports (%)
    "inflation" = "FP.CPI.TOTL.ZG",         # Inflation rate
    "educ_spending" = "SE.XPD.TOTL.GD.ZS"   # Education spending (%GDP)
  ), 
  extra = TRUE,
  start = 1960,
  end = 2024) # |> filter(lastupdated == max(lastupdated))

Filtering observations ( = operate on rows).

filter(wb_data[,1:8], year > 2012, country == "India")
country iso2c iso3c year status lastupdated pop pop_growth
India IN IND 2014 2025-10-07 1312277191 1.2612903
India IN IND 2015 2025-10-07 1328024498 1.1928556
India IN IND 2013 2025-10-07 1295829511 1.3327043
India IN IND 2016 2025-10-07 1343944296 1.1916297
India IN IND 2017 2025-10-07 1359657400 1.1623961
India IN IND 2024 2025-10-07 1450935791 0.8907065
India IN IND 2018 2025-10-07 1374659064 1.0972991
India IN IND 2023 2025-10-07 1438069596 0.8832895
India IN IND 2022 2025-10-07 1425423212 0.7902005
India IN IND 2021 2025-10-07 1414203896 0.8226482
India IN IND 2020 2025-10-07 1402617695 0.9734386
India IN IND 2019 2025-10-07 1389030312 1.0400140

Ok so we see the ordering is a mess (look at the years)…
Let’s sort that out.

wb_data <- arrange(wb_data, country, year)
tail(wb_data, 5) 
country iso2c iso3c year status lastupdated pop pop_growth gdp_percap gdp R_D high_tech_exp inflation educ_spending region capital longitude latitude income lending
17286 Zimbabwe ZW ZWE 2020 2025-10-07 15526888 1.659353 1730.454 26868564055 NA 2.383868 557.20182 NA Sub-Saharan Africa Harare 31.0672 -17.8312 Lower middle income Blend
17287 Zimbabwe ZW ZWE 2021 2025-10-07 15797210 1.726011 1724.387 27240507842 NA 1.053740 98.54611 NA Sub-Saharan Africa Harare 31.0672 -17.8312 Lower middle income Blend
17288 Zimbabwe ZW ZWE 2022 2025-10-07 16069056 1.706209 2040.547 32789657378 NA 1.476931 104.70517 NA Sub-Saharan Africa Harare 31.0672 -17.8312 Lower middle income Blend
17289 Zimbabwe ZW ZWE 2023 2025-10-07 16340822 1.677096 2156.034 35231369343 NA 1.942188 NA 0.3847713 Sub-Saharan Africa Harare 31.0672 -17.8312 Lower middle income Blend
17290 Zimbabwe ZW ZWE 2024 2025-10-07 16634373 1.780482 2656.409 44187704410 NA NA NA NA Sub-Saharan Africa Harare 31.0672 -17.8312 Lower middle income Blend

For descending sorting:

arrange(wb_data, desc(country), year) |> head(7)
country iso2c iso3c year status lastupdated pop pop_growth gdp_percap gdp R_D high_tech_exp inflation educ_spending region capital longitude latitude income lending
Zimbabwe ZW ZWE 1960 2025-10-07 3809389 NA 276.4198 1052990485 NA NA NA NA Sub-Saharan Africa Harare 31.0672 -17.8312 Lower middle income Blend
Zimbabwe ZW ZWE 1961 2025-10-07 3930401 3.127265 279.0165 1096646688 NA NA NA NA Sub-Saharan Africa Harare 31.0672 -17.8312 Lower middle income Blend
Zimbabwe ZW ZWE 1962 2025-10-07 4055959 3.144570 275.5456 1117601690 NA NA NA NA Sub-Saharan Africa Harare 31.0672 -17.8312 Lower middle income Blend
Zimbabwe ZW ZWE 1963 2025-10-07 4185877 3.152908 277.0057 1159511793 NA NA NA NA Sub-Saharan Africa Harare 31.0672 -17.8312 Lower middle income Blend
Zimbabwe ZW ZWE 1964 2025-10-07 4320006 3.154055 281.7445 1217138098 NA NA NA NA Sub-Saharan Africa Harare 31.0672 -17.8312 Lower middle income Blend
Zimbabwe ZW ZWE 1965 2025-10-07 4458462 3.154707 294.1454 1311435906 NA NA NA NA Sub-Saharan Africa Harare 31.0672 -17.8312 Lower middle income Blend
Zimbabwe ZW ZWE 1966 2025-10-07 4601217 3.151697 278.5675 1281749603 NA NA NA NA Sub-Saharan Africa Harare 31.0672 -17.8312 Lower middle income Blend

Before we continue, we want to introduce a wonder of coding efficiency: THE PIPE OPERATOR.
In fact, there are two of them and they are pretty similar: %>% and |>.
The idea is to chain operations: the outcome of what comes before a pipe becomes the input of the pipe.
We replicate the filter code above but via piping.

wb_data |> filter(country == "India", year > 2020)
country iso2c iso3c year status lastupdated pop pop_growth gdp_percap gdp R_D high_tech_exp inflation educ_spending region capital longitude latitude income lending
India IN IND 2021 2025-10-07 1414203896 0.8226482 2239.614 3.167271e+12 NA 10.21256 5.131407 4.629500 South Asia New Delhi 77.225 28.6353 Lower middle income IBRD
India IN IND 2022 2025-10-07 1425423212 0.7902005 2347.448 3.346107e+12 NA 12.68228 6.699034 4.098658 South Asia New Delhi 77.225 28.6353 Lower middle income IBRD
India IN IND 2023 2025-10-07 1438069596 0.8832895 2530.120 3.638489e+12 NA 14.93435 5.649143 NA South Asia New Delhi 77.225 28.6353 Lower middle income IBRD
India IN IND 2024 2025-10-07 1450935791 0.8907065 2696.664 3.912686e+12 NA NA 4.953036 NA South Asia New Delhi 77.225 28.6353 Lower middle income IBRD

And we check that, indeed, the data is now sorted. But now with pipes.
Henceforth, we will always resort to piping…

Select columns ( = operate on columns).

wb_data |> select(country, year, pop_growth, gdp) |> tail()
country year pop_growth gdp
17285 Zimbabwe 2019 1.563533 25715657177
17286 Zimbabwe 2020 1.659353 26868564055
17287 Zimbabwe 2021 1.726011 27240507842
17288 Zimbabwe 2022 1.706209 32789657378
17289 Zimbabwe 2023 1.677096 35231369343
17290 Zimbabwe 2024 1.780482 44187704410

Create new columns with mutate().

wb_data |> 
  mutate(total_educ_spending = educ_spending * gdp) |>
  select(country, year, total_educ_spending) |>
  filter(is.finite(total_educ_spending)) |>
  tail()
country year total_educ_spending
6336 Zimbabwe 2012 103890204643
6337 Zimbabwe 2013 114469274331
6338 Zimbabwe 2014 119670494331
6339 Zimbabwe 2017 102322600059
6340 Zimbabwe 2018 70036650843
6341 Zimbabwe 2023 13556019685

3.4 Tidy vs messy data

Tidy data seems simple:

  1. Each variable is a column; each column is a variable.
  2. Each observation is a row; each row is an observation.

But it’s not always easy to spot or understand at first.
A counter-example can help:

data.frame(Year = c(1970, 1990, 2010),
           France = c(52, 59, 65),
           Germany = c(61, 80, 82),
           UK = c(56, 57, 63))
Year France Germany UK
1970 52 61 56
1990 59 80 57
2010 65 82 63

“France” is not a variable name. It’s a value for the variable “country”…
Luckily, we have a tool to turn a dataset from messy to tidy: pivot_longer().
All you need to do is:

  1. Determine which columns (variables) to pivot;
  2. Pick a name for the new column that will feature the column names;
  3. Choose another name for the new column that will store the values.

An example below.

data.frame(Year = c(1970, 1990, 2010),
           France = c(52, 59, 65),
           Germany = c(61, 80, 82),
           UK = c(56, 57, 63)) |>
  pivot_longer(-Year, names_to = "Country", values_to = "Population")
Year Country Population
1970 France 52
1970 Germany 61
1970 UK 56
1990 France 59
1990 Germany 80
1990 UK 57
2010 France 65
2010 Germany 82
2010 UK 63

Same information, but structured/presented differently.
What happened? See below…

Another example on a transposed version (countries in rows and years in columns):

3.5 Pivot tables

These beasts can be incredibly useful. They require two steps:

  1. Determine the dimensions (categorical columns) along which to perform the analysis: this is done with group_by();
  2. Define the key metrics to compute: this is done with summarize().
wb_data |> 
  filter(year == 2023, region != "Aggregates") |>
  group_by(region) |>
  summarise(total_population = sum(pop))
region total_population
East Asia & Pacific 2361106057
Europe & Central Asia 925640799
Latin America & Caribbean 654413871
Middle East & North Africa 508842198
North America 376954413
South Asia 1951539835
Sub-Saharan Africa 1241543738

Below, “na.rm = T” means that NA data will be removed before applying the function.

wb_data |>
  filter(region != "Aggregates") |>
  group_by(region, year) |>
  summarise(avg_gdp_percap = mean(gdp_percap, na.rm = T)) |>
  head(8)
region year avg_gdp_percap
East Asia & Pacific 1960 522.1182
East Asia & Pacific 1961 535.4375
East Asia & Pacific 1962 546.7194
East Asia & Pacific 1963 594.2096
East Asia & Pacific 1964 636.9282
East Asia & Pacific 1965 785.9105
East Asia & Pacific 1966 839.3606
East Asia & Pacific 1967 795.2839

A last example.

wb_data |>
  group_by(income) |>
  summarise(avg_gdp_percap = mean(gdp_percap, na.rm = T))
income avg_gdp_percap
Aggregates 5285.0354
High income 22472.9538
Low income 433.7686
Lower middle income 1150.8568
Not classified 4101.0849
Upper middle income 3475.4895
NA 4405.7019

What is a low income country?

wb_data |> 
  filter(income == "Low income") |>
  select(country) |>
  distinct()
country
Afghanistan
Burkina Faso
Burundi
Central African Republic
Chad
Congo, Dem. Rep.
Eritrea
Ethiopia
Gambia, The
Guinea-Bissau
Korea, Dem. People’s Rep.
Liberia
Madagascar
Malawi
Mali
Mozambique
Niger
Rwanda
Sierra Leone
South Sudan
Sudan
Syrian Arab Republic
Togo
Uganda
Yemen, Rep.

3.6 Plots

In “ggplot”, the GG stands for grammar of graphics, a very neat way to think of plots.
It decomposes graphs into specific elements (layers), see illustration below.

Of the above, only the bottom 3 are indispensable. The above one are for customization.
Geometries are plot types, see the poster below (link on github).
To find inspiration, you can also type “chart chooser” on Google…

Let’s illustrate this with a few examples. Below, we show information along 4 dimensions:

  • x-axis
  • y-axis
  • color
  • shape of point
wb_data |>
  filter(region != "Aggregates") |>
  group_by(region, income) |>
  summarise(avg_wealth = mean(gdp_percap,  na.rm = T),
            avg_educ = mean(educ_spending, na.rm = T)) |>
  na.omit() |>
  ggplot(aes(x = avg_educ, y = avg_wealth, color = region, shape = income)) + 
  geom_point(size = 5) +
  xlab("Eduction spending") + ylab("Wealth") +
  theme_classic()

Other potential dimensions include:

  • alpha (transparency)
  • size (for points)
  • fill (the inside of shapes/rectangles)
  • linetype (for lines)
  • linewidth (for lines)

Look at the %in% operator below… much better than OR |

wb_data |>
  filter(country %in% c("France", "Italy", "United Kingdom")) |>
  ggplot(aes(x = year, y = pop/10^6, color = country)) + 
  geom_line() + geom_point() + 
  theme_classic() + 
  theme(axis.title = element_blank(),
        text = element_text(size = 13),
        legend.text = element_text(size = 13),
        legend.title = element_text(face = "bold", size = 15),
        legend.position = c(0.2,0.8))

A closer look at the syntax:

ggplotfunctioncall(aes(x = yearcolumnname,y = populationcolumnname,,color = countrycolumnnameaesthetics))+geom_line()geometry(graph type) \underbrace{\text{ggplot}}_{\substack{\text{function} \\ \text{call}}}\left(\underbrace{\text{aes}(\text{x = } \overbrace{\text{year}}^{\substack{\text{column} \\ \text{name}}}, \ \text{y = } \overbrace{\text{population}}^{\substack{\text{column} \\ \text{name}}}, , \ \text{color = } \overbrace{\text{country}}^{\substack{\text{column} \\ \text{name}}}}_{\text{aesthetics}})\right) \ + \ \underbrace{\text{geom_line()}}_{\substack{\text{geometry}\\ \text{(graph type)}}}

Both elements the aesthetics and the geom type are crucial.
Also, mind the “+” that separates the layers, it is not a pipe!

Another example, using a recycled pivot table.
na.omit() removes rows with missing data.

wb_data |>
  group_by(income) |>
  summarise(avg_gdp_percap = mean(gdp_percap, na.rm = T)) |>
  na.omit() |>
  ggplot(aes(y = reorder(income, avg_gdp_percap), x = avg_gdp_percap)) + 
  geom_col() +
  xlab("Average GDP per capita") + ylab("") + 
  theme_light()

A last one for the road.

wb_data |>
  filter(region != "Aggregates") |>
  group_by(region, year) |>
  summarise(avg_gdp_percap = mean(gdp_percap, na.rm = T)) |>
  ggplot(aes(x = year, y = avg_gdp_percap, color = region)) + 
  geom_line(linewidth = 0.9) + 
  scale_y_log10() + 
  theme_classic() + 
  theme(axis.title = element_blank(),
        text = element_text(size = 13)) +
  ggtitle("GDP per capita (log-scale)")

4 Other datasets

4.1 Airline traffic

Let’s have a look at some data. Below, we dive into airline traffic, obtained from Aéroport de Paris.1

Upon inspection, we need to open the third spreadsheet and focus on the first three columns:

  • date
  • number of passengers for CDG
  • number of passengers for ORY

This file was manually edited to avoid errors (duplicated dates in the original sample.).

url <- "https://github.com/shokru/coqueret.github.io/raw/refs/heads/master/files/misc/time_series/traffic-sheet.xlsx"
airline <- read.xlsx(url, sheet = 3, startRow = 2)
head(airline)
     X1 Paris.-.Charles.de.Gaulle Paris.-.Orly   Total
1 36526                   3223328      1935261 5158589
2 36557                   3289676      1942750 5232426
3 36586                   3891206      2204640 6095846
4 36617                   4221430      2266448 6487878
5 36647                   4217758      2203570 6421328
6 36678                   4279344      2190218 6469562
  Paris.-.Charles.de.Gaulle Paris.-.Orly Total
1                     39489        19956 59445
2                     38386        19347 57733
3                     42049        21286 63335
4                     42931        20192 63123
5                     43909        20921 64830
6                     41966        19331 61297

Next, let do a bit of wrangling.

air_data <- airline |> select(1:3)                  # Keep only the first 3 columns
colnames(air_data) <- c("date", "CDG", "ORY")       # Rename these columns
air_data <- air_data |>                             # Reformat the date
  mutate(date = as.Date(as.numeric(date), origin = "1899-12-30")) |>
  filter(is.finite(date), is.finite(CDG)) |>
  arrange(date) 

To make the most of the {fable} package, we need to embed the dataframe into a tsibble, a kind of strange animal (data format) that the package really likes.

air_data <- air_data |>
  mutate(date = yearmonth(date)) |>
  distinct(date, .keep_all = T) |>
  as_tsibble() |>
  fill_gaps()

Let’s plot all of this!

air_data |> 
  pivot_longer(-date, names_to = "airport", values_to = "passengers") |>
  ggplot(aes(x = date, y = passengers/10^6, color = airport)) + geom_line() +
  theme_classic() + ggtitle("Passengers in millions") +
  theme(axis.title = element_blank(),
        title = element_text(face = "bold"),
        legend.position = c(0.1,0.9))

What are some patterns that we can identify?

air_data %>% 
  model(classical_decomposition(CDG)) %>% 
  components() %>% 
  autoplot() + theme_light()

A look at seasonality.

air_data |>
  as.data.frame() |>
  mutate(month = month(date)) |>
  group_by(month) |>
  summarise(avg_pass = mean(CDG/10^6)) |>
  ggplot(aes(x = as.factor(month), y = avg_pass)) + geom_col() +
  xlab("month") + ylab("")

Summer months, as expected, have more passenger rotations.

4.2 Atmospheric CO2 concentration

url <- "https://gml.noaa.gov/webdata/ccgg/trends/co2/co2_mm_mlo.csv"
co2 <- read.csv(url, skip = 40)
co2 <- co2 |> 
  mutate(date = make_date(year = year, month = month, day = 15)) |>
  select(date, average, deseasonalized) |>
  mutate(date_iso = date,
         date = yearmonth(date)) |>
  as_tsibble(index = date)

A sneak peek…

co2 |> 
  pivot_longer(-c(date, date_iso), names_to = "series", values_to = "value") |>
  ggplot(aes(x = date, y = value, color = series)) + geom_line() +
  theme_classic() + ggtitle("CO2 concentration (ppm)") +
  theme(legend.position = c(0.2,0.8),
        text = element_text(size = 15),
        legend.title = element_blank(),
        axis.title = element_blank())

How can we explain the oscillations?
Let’s decompose this.

co2 %>% 
  model(classical_decomposition(average)) %>% 
  components() %>% 
  autoplot() +
  theme_light()

4.3 Financial series

4.3.1 Stocks

tickers <- c("AAPL", "BA", "C", "PFE", "WMT", "XOM")  
# Tickers: Apple, Boeing, Citigroup, Pfizer, Walmart, Exxon
# Others: , "F", "DIS", "GE", "CVX", "MSFT", "GS"
min_date <- "2000-01-01"                      # Starting date
max_date <- "2025-10-30"                      # Ending date
prices <- getSymbols(tickers, src = 'yahoo',  # The data comes from Yahoo Finance
                     from = min_date,         # Start date
                     to = max_date,           # End date
                     auto.assign = TRUE, 
                     warnings = FALSE) %>% 
  map(~Ad(get(.))) %>%                        # Retrieving the data
  reduce(merge) %>%                           # Merge in one dataframe
  `colnames<-`(tickers)                       # Set the column names
prices |> tail()
             AAPL     BA      C   PFE    WMT    XOM
2025-10-22 258.45 216.59  96.30 24.72 107.14 114.71
2025-10-23 259.58 217.77  96.69 24.67 106.86 115.98
2025-10-24 262.82 221.35  98.78 24.76 106.17 115.39
2025-10-27 268.81 223.00 100.99 24.77 104.47 115.94
2025-10-28 269.00 223.33 101.39 24.50 103.17 115.03
2025-10-29 269.70 213.58  99.12 24.29 102.46 116.45
prices %>%
  as.data.frame() %>%
  rownames_to_column(var = "Date") %>%
  mutate(Date = as.Date(Date)) |>
  pivot_longer(-Date, names_to = "Asset", 
               values_to = "Price") %>%
  ggplot(aes(x = Date, y = Price, color = Asset)) + geom_line() + 
  facet_wrap(vars(Asset), scales = "free") + theme_light()

Any pattern that you can recognize?

4.3.2 Cryptocurrencies

We will use the {crypto2} package below. If need be: install it.
First, let’s have a look at the list of available coins… which are numerous!

coins <- crypto_list() 

You can have a look at the info for the coins via the code below.

c_info <- crypto_info(coin_list = coins, limit = 30)
❯ Scraping crypto info
❯ Processing crypto info

Next, we can download historical quotes.
Symbols have duplicates, we need to use “slugs”.

coin_symb <- c("bitcoin", "ethereum", "tether", "xrp")
coin_hist <- crypto_history(coins |> dplyr::filter(slug %in% coin_symb),
                            start_date = "20170101",
                            end_date = "20250925")
❯ Scraping historical crypto data
❯ Processing historical crypto data
coin_hist <- coin_hist |>  # Timestamps are at midnight.
  mutate(date = as.Date(as.POSIXct(timestamp, origin="1970-01-01")))

Mind the log-scale!

coin_hist |>
  ggplot(aes(x = date, y = close, color = name)) + geom_line() +
  scale_y_log10() + theme_bw() + scale_color_d3() +
  theme(legend.position = c(0.75,0.5),
        axis.title = element_blank(),
        legend.title = element_blank()) 

Any pattern that you can recognize here again?

Unfortunately, this kind of data is hard to analyze directly - more on that next time!

5 Descriptive statistics (to be done in class)

Pick a dataset… & explore!

  1. What is the average rate of increase of CO2_2 in the atmosphere?
  2. Using the code below, determine during which month airports see the most passengers.
  3. Which stock/crypto had the best performance over the sample we downloaded?
  4. Which country is the richest (GDP per capita), and which one spends the most on R&D, or education?
air_data |> mutate(month = month(date)) 

6 Wrap-up

The things you need to remember:

  • the tidyverse functions: filter, select, arrange, mutate, group_by and summarize. They are incredible tools to analyze data rapidly.
  • There are a small number of patterns that are easy to recognize (seasonality, trends). But there are also unpredictable shocks, some small, some very large. It is these shocks that we will try to model henceforth.

Footnotes

  1. See at the bottom of the page, downloaded in October and covering 2000-2025.↩︎